Document Recovery from Bag-of-Word Indices
نویسندگان
چکیده
Motivated by computer privacy issues, we present the novel problem of document recovery from an index: given only a document’s bag-of-words (BOW) vector or other type of index, reconstruct the original ordered document. We investigate a variety of index types, including count-based BOW vectors, stopwords-removed count BOW vectors, indicator BOW vectors, and bigram count vectors. We formulate the problem as hypothesis rescoring using A∗ search with the Google Web 1T 5-gram corpus. Our experiments on five domains indicate that if original documents are short, the documents can be recovered with high accuracy.
منابع مشابه
یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کاملTaxonomy-based Document Clustering
AbstrAct: One well-known document representation for text clustering is bag-of-words. Although it is simple and popular, it ignores semantics, underly ing linguistic information, and word correlations. In this paper, Bag-Of-Queries, a new document representation is proposed. First, a taxonomy of the terms in the local dictionary derived for data set is extracted. Ex tracting taxonomy is perform...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملLearning to retrieve out-of-vocabulary words in speech recognition
Many Proper Names (PNs) are Out-Of-Vocabulary (OOV) words for speech recognition systems used to process diachronic audio data. To help recovery of the PNs missed by the system, relevant OOV PNs can be retrieved out of the many OOVs by exploiting semantic context of the spoken content. In this paper, we propose two neural network models targeted to retrieve OOV PNs relevant to an audio document...
متن کاملEffect of Document Representation on the Performance of Medical Document Classification
Text classification in the medical domain is a real world problem with wide applicability. This paper investigates extensively the effect of text representation approaches on the performance of medical document classification. To accomplish this objective, we evaluated seven different approaches to represent real word medical documents. The text representation approaches investigated in this pa...
متن کامل